Why can’t I run a 100-node CockroachDB cluster?

CockroachDB is designed to be a scalable, survivable, and strongly consistent SQL database. Building a distributed system with these capabilities is a big task. Beyond the required functionality, it must also be correct, performant, and stable, or it isn’t worth the bits used to copy the binary.

From the start, our approach was to focus first on correctness, as it always proves the most difficult to retrofit, next on performance for single nodes, and then on stability and performance for multi-node clusters. Since announcing our beta in March, we’ve made substantial progress on the first two points; over 2,000 commits to the GitHub repo have added an array of distributed SQL capabilities, made substantial performance improvements, and put significant work into distributed SQL execution.

Unfortunately, what’s been lost in the fast-paced shuffle is sustained focus on stability. Our Q3 goal of a 10-node cluster running under continuous load for two solid weeks has been maddeningly elusive.

Factors contributing to product instability include:

Components are intertwined with enough complexity that surprising emergent behavior must be dealt with.
Stability fixes are often under-designed, failing to address a problem adequately.
Fixes are also sometimes over-designed, adding unnecessary complexity.
Code churn (refactorings, new feature work, dependency upgrades) makes stability a moving target.
No engineers are focused on stability full-time.

A Focus on Stability

Stability has become our most pressing concern. To effectively address it, we’re internally designating it as a “code yellow.” This is a Googlism, intended to accord extra importance to an effort of singular importance to the company. Without stability, we don’t have a working database, so the priority here is appropriate. During a code yellow, any requests made on its behalf take immediate precedence: as in, “drop everything else and focus on stability!”

We’ve made several changes as part of the stability code yellow. First, we’ve split out a team to focus full-time; other team members will also be involved anywhere stability work overlaps their areas of expertise. Second, we’ve decided to stabilize the master branch in isolation; new feature development will continue in the develop branch. The develop branch will periodically merge stability fixes from master. Finally, merges to master will be tightly controlled by the stability team, who have responsibility for reviewing all changes, and for cherry-picking changes to develop which have stability overlap.

What it Means for Contributors

For contributors, working in develop instead of master may be disruptive, though that decision was made with intention. As long as this stability push is underway, we want contributors to consider the impact of their changes. In particular, avoid refactorings or dependency upgrades which affect the storage, rpc, or kv packages. The hope is we can minimize the complexity of downstream merging, as well as decrease the likelihood that those merges reintroduce instability.

What it Means for Users

Users of CockroachDB may have noticed our beta release schedule has slowed. We expect to resume weekly releases once we restore basic stability, likely next week. Future beta releases will primarily contain stability fixes, and will not include new features being added to the develop branch. To get access to new features in develop, you’ll need to build from source. The list of new features is small now, but it will continue to grow over the course of this code yellow.

With a lot of hard work and a little luck, we’ll have CockroachDB clusters running smoothly. Stay tuned for my next blog post, which will be a post-mortem of this stability code yellow with lessons learned.

Stability

Engineering